class: center, middle, inverse, title-slide # Introduction to Survey Data Cleaning Using Tidyverse in R ## Data Wrangling - Part 2 ### Johannes Breuer
Stefan Jünger ### 2021-07-22 --- layout: true <div class="my-footer"> <div style="float: left;"><span>Johannes Breuer, Stefan Jünger</span></div> <div style="float: right;"><span>ESRA 2021, 2021-07-22</span></div> <div style="text-align: center;"><span>Data Wrangling - Part 2</span></div> </div> --- ## Data wrangling continued 🤠 While in the last session we focused on changing the structure of our data by **selecting**, **renaming**, and **relocating** columns and **filtering** and **arranging** rows, in this part we will focus on altering the content of data sets by *adding* and *changing* variables and variable values. More specifically, we will deal with... - creating and computing new variables (in various ways) - recoding the values of a variable - dealing with missing values --- ## `dplyr::mutate()` <img src="data:image/png;base64,#./pics/dplyr_mutate.png" width="60%" style="display: block; margin: auto;" /> <small><small>Artwork by [Allison Horst](https://github.com/allisonhorst/stats-illustrations)</small></small> --- ## Creating a new variable A simple example for creating a new variable is adding a numeric ID variable based on the row number in the data set. .small[ ```r gpc <- gpc %>% * mutate(id = row_number()) %>% relocate(id, .before = everything()) # move the id column before all other columns gpc %>% select(1:5) %>% glimpse ``` ``` ## Rows: 3,765 ## Columns: 5 ## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1~ ## $ cohort <dbl> 2, 1, 1, 2, 3, 2, 1, 2, 2, 2, 1, 3, 3, 1, 2, 1, 1, 1, 3,~ ## $ sex <dbl> 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1,~ ## $ age_cat <dbl> 10, 2, 8, 1, 7, 7, 7, 7, 8, 6, 9, 7, 2, 2, 7, 7, 7, 4, 1~ ## $ education_cat <dbl> 3, 3, 1, 3, 3, 2, 3, 3, 3, 2, 2, 2, 3, 3, 2, 3, 2, 2, 2,~ ``` ] *Note*: The function `rowid_to_column()` from the `tibble` package is an alternative for this which also automatically includes the id variable as the first column. --- ## Recoding values Very often we want to recode values in a variable (e.g., if we have reverse-scored items as part of a scale). Say, for example, you want to recode the item from the *GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany* that measures trust in scientists with regard to dealing with the Coronavirus to represent distrust. .small[ ```r gpc <- gpc %>% mutate(hzcy052aR = recode(hzcy052a, `5` = 1, # `old value` = new value `4` = 2, `2` = 4, `1` = 5)) gpc %>% select(hzcy052a, hzcy052aR) %>% head() ``` ``` ## # A tibble: 6 x 2 ## hzcy052a hzcy052aR ## <dbl> <dbl> ## 1 5 1 ## 2 5 1 ## 3 4 2 ## 4 NA NA ## 5 5 1 ## 6 98 98 ``` ] --- ## Missing values Most of the real datasets we work with have missing data. As the data can be missing for various reasons, we often use codes (and labels) to distinguish between different types of missing data. If you look at the the [codebook](https://dbk.gesis.org/dbksearch/download.asp?id=67378) of the *GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany* or the [*GESIS Panel* Cheatsheet](https://www.gesis.org/fileadmin/upload/GESIS_Panel/Cheatsheet/gesis_panel_cheatsheet.pdf), you will see that there are quite a few types of and codes for missing data. Some types of missing values are the same across variables, while some variables also have additional types of missing data (and, hence, additional codes for missings). Notably, however, in the process of creating the synthetic data we use in this course, all values < 0 have been changed to `NA`.<sup>1</sup> .footnote[ [1] `NA` is a reserved term in `R`, meaning that you cannot use it as a name for anything else (this is also the case for `TRUE` and `FALSE`) ] --- ## Wrangling missing values When we prepare our data for analysis there are generally two things we might want/have to do with regard to missing values: - define specific values as missings (i.e., set them to `NA`) - recode `NA` values into something else (typically to distinguish between different types of missing values) --- class: center, middle # [Exercise](https://jobreu.github.io/tidyverse-workshop-esra-2021/exercises/DataWrangling2_question.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://jobreu.github.io/tidyverse-workshop-esra-2021/solutions/DataWrangling2_solution.html)